Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models

نویسندگان

چکیده

In interactive multi-objective reinforcement learning (MORL), an agent has to simultaneously learn about the environment and preferences of user, in order quickly zoom on those decisions that are likely be preferred by user. this paper we study MORL context multi-armed bandits. Contrary earlier approaches force utility user expressed as a weighted sum values for each objective, do not make such stringent priori assumptions. Specifically, only allow non-linear preferences, but also obviate need specify exact model class function must fall. To achieve this, propose new approach called Gaussian-process Utility Thompson Sampling (GUTS). GUTS employs parameterless Bayesian any type function, exploits monotonicity information, limits number queries posed ensuring questions statistically significant. We show empirically can regret highly sub-linear arm pulls. (A preliminary version work was presented at ALA workshop 2018 []).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MO...

متن کامل

Multi-Objective X -Armed Bandits

Many of the standard optimization algorithms focus on optimizing a single, scalar feedback signal. However, real-life optimization problems often require a simultaneous optimization of more than one objective. In this paper, we propose a multi-objective extension to the standard X -armed bandit problem. As the feedback signal is now vector-valued, the goal of the agent is to sample actions in t...

متن کامل

Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem

The multi-objective multi-armed bandit (MOMAB) problem is a sequential decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single scalar reward. Moreover, these multiple rewards might be conflicting. The MOMAB-problem has a set of Pareto optimal arms and an agent’s goal is not only to find that set but also to play evenly or fairly the arms in that set....

متن کامل

Active Learning in Multi-armed Bandits

We consider the problem of actively learning the mean values of distributions associated with a finite number of options (arms). The decision maker can select which option to generate the next sample from, the goal being to produce estimates with equally good precision for all the options. If sample means are used to estimate the unknown values then the optimal solution, assuming full knowledge...

متن کامل

Exploiting Similarity Information in Reinforcement Learning - Similarity Models for Multi-Armed Bandits and MDPs

This paper considers reinforcement learning problems with additional similarity information. We start with the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that this color information can be used to improve the dependency of online regret boun...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-67664-3_28